Mining and Modeling Relations between Formal and Informal Chinese Phrases from Web Corpora

نویسندگان

  • Zhifei Li
  • David Yarowsky
چکیده

We present a novel method for discovering and modeling the relationship between informal Chinese expressions (including colloquialisms and instant-messaging slang) and their formal equivalents. Specifically, we proposed a bootstrapping procedure to identify a list of candidate informal phrases in web corpora. Given an informal phrase, we retrieve contextual instances from the web using a search engine, generate hypotheses of formal equivalents via this data, and rank the hypotheses using a conditional log-linear model. In the log-linear model, we incorporate as feature functions both rule-based intuitions and data co-occurrence phenomena (either as an explicit or indirect definition, or through formal/informal usages occurring in free variation in a discourse). We test our system on manually collected test examples, and find that the (formal-informal) relationship discovery and extraction process using our method achieves an average 1-best precision of 62%. Given the ubiquity of informal conversational style on the internet, this work has clear applications for text normalization in text-processing systems including machine translation aspiring to broad coverage.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On the unsupervised analysis of domain-specific Chinese texts.

With the growing availability of digitized text data both publicly and privately, there is a great need for effective computational tools to automatically extract information from texts. Because the Chinese language differs most significantly from alphabet-based languages in not specifying word boundaries, most existing Chinese text-mining methods require a prespecified vocabulary and/or a larg...

متن کامل

Mining Videos from the Web for Electronic Textbooks

We propose a system for mining videos from the web for supplementing the content of electronic textbooks in order to enhance their utility. Textbooks are generally organized into sections such that each section explains very few concepts and every concept is primarily explained in one section. Building upon these principles from the education literature and drawing upon the theory of Formal Con...

متن کامل

Anxiety symptoms associated with problematic smartphone use severity: The mediation role of COVID-19 anxiety

Background: The Coronavirus Disease 2019 (COVID-19) pandemic is regarded as the biggest global health crisis in recent decades. The changes in major life domains due to infection control strategies resemble the functional impairment consequential to emotional distress and place many people at greater risk of psychiatric conditions. Meanwhile, the COVID-19 pandemic and associated social distanci...

متن کامل

Causal Knowledge Modeling for Traditional Chinese Medicine using OWL 2

Unlike Western Medicine, those in Traditional Chinese Medicine (TCM) are based on inherent rules or patterns, which can be considered as causal links. Existing approaches tend to apply computational methods on semantic ontology to do knowledge mining, but it cannot perfectly make use of internal principles in TCM. When it comes to knowledge representation, we can transform this inherent knowled...

متن کامل

Applying ontology design patterns to the implementation of relations in GENIA

Motivation: Annotated reference corpora such as the GENIA corpus play an important role in biomedical information extraction. A semantic annotation of the natural language texts in these reference corpora using formal ontologies and logic is challenging due to the ambiguous use of natural language and natural language semantics. Providing formal definitions and axioms for these relations would ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008